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Identification of essential genes is one of the ultimate goals of drug designs. 
Here we introduce an in silico method to select essential genes through the 
microarray assay. We construct a graph of genes, called the gene transcrip- 
tion network, based on the Pearson correlation coefficient of the microarray 
expression level. Links are connected between genes following the order of 
the pair-wise correlation coefficients. We find that there exist two mean- 
ingful fractions of links connected, pm and ps-, where the number of clusters 
becomes maximum and the connectivity distribution follows a power law, 
respectively. Interestingly, one of clusters at Pm contains a high density of 
essential genes having almost the same functionality. Thus the deletion of all 
genes belonging to that cluster can lead to lethal inviable mutant efficiently. 
Such an essential cluster can be identified in a self-organized way. Once we 
measure the connectivity of each gene at ps- Then using the property that 
the essential genes are likely to have more connectivity, we can identify the 
essential cluster by finding the one having the largest mean connectivity per 
gene at p^- 
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Thousands of genes and their products in a given hving organism is be- 
Heved to function in a concerted way that creates the mystery of hfe ^J. 
Such cooperative functions between genes can be visuahzed through a graph 
where nodes denote genes and hnks do activating or repressing effects on 
transcription |2l El • Traditional methods in molecular biology are very lim- 
ited to analyze such large-scale interactions among thousands of genes, so 
that a global picture of gene functions is hard to obtain. The recent advent 
of the microarray assay has enough attraction to researcher, allowing them to 
decipher gene interactions in a more efficient way While the data through 
the microarray assay are not sufficiently accumulated yet and they are also 
susceptible to errors in detecting the expression level, the microarray assay is 
an potential candidate for a fundamental approach to understand large-scale 
gene complexes and can be used in many applications such as drug design 
and toxicological research. 

Since the microarray technology is having a significant impact on ge- 
nomics study, many methods for pattern interpretation have been developed, 
including the K-means clustering j3], the self-organizing map [H], the hierar- 
chical method [Zj, the relevance network method |H|, etc. All such methods, 
however, contain tunable thresholds, so that the results obtained through 
those methods could be misled by the thresholds artificially chosen. While 
those methods are useful for clustering genes, they cannot give any informa- 
tion needed to identify essential genes. Here the essential genes mean target 
genes for drug designs, because the deletion of them leads to inviable mutant 
to the organism. 

In this paper, we propose a novel in silico method to identify essential 
genes from microarray database. Our method is inspired by the combina- 
tion of the gene clustering and the close relationship between the lethality 
or essentiality of genes and the connectivity in a network. Once genes are 
clustered by using a graph theory and the cluster containing a high density 
of essential genes is identified by using the relationship between the lethality 
and the connectivity of the graph ^ . The main difference from the previous 
work P lies in that while the previous method mainly deals with genes in- 
dividually, our method does rather with clusters of genes moduled by their 
functionality, which turns out to be much more efficient in selecting essential 
genes. In our method, we do not use any threshold artificially. Thus the 
essential genes can be identified in a self-organized way. Moreover we find 
that the genes belonging to the same cluster are moduled in their functions. 
Since the essential genes we select belong to the same cluster, we can se- 
lect the essential genes from almost the same functional module. Finally we 



2 



propose the functions of unknown genes in the yeast protein database clas- 
sification as the major function of the known genes belonging to each cluster. 



Basic Concepts 

The method is inspired from the two concepts: (i) the percolation clustering 
moduled by their functions and (ii) the relationship between the essentiality 
and the inhomogeneous connectivity distribution in biochemical networks. 
First the percolation concept ^Oj is well known to physicist and has been 
applied to many systems including the composite system of metals and insu- 
lators exhibiting the transition as the metal concentration p changes. When 
p is small enough, there are many small size clusters of metal and no giant 
cluster spanning from one end to the other, leading the system to be in an 
insulator phase. As p increases, the number of clusters increases. There ex- 
ists a critical value pc called the percolation threshold above which small-size 
clusters are connected together and a giant cluster forms, spanning the entire 
system. Then the system turns into a metal state. 

Next recently there are many studies for complex systems in terms of 
graph. In graph theory, a graph is composed of nodes and links. Degree of a 
certain node is the number of links connected to that node. The emergence 
of a power law in the degree distribution. 



in complex networks has recently attracted many attentions IT^ . The 
network following such a power-law degree distribution is called scale-free 
network. The scale-free networks are ubiquitous in nature such as social, 
biological, information systems, etc. For example, for the protein interaction 
network where nodes represent proteins and links do their interactions, the 
degree distribution follows a power law ^Sl ^M- Such a behavior implies 
there exist a few hub proteins having a large number of connections com- 
pared with other proteins. Recently it was shown that such hub proteins 
are more likely to be essential f9^ . For the yeast protein interaction network, 
the probability of the proteins with the first 0.7% ranks is as high as 62%. 
Thus it was proposed that the selection of essential proteins can be made by 
finding highly connected proteins. 

Microarray data 

We apply those concepts to the microarray data downloaded from R.ef. [TE\ 
containing 287 single gene deletion S. cerevisiae mutant strains. The deletion 
data elucidate generic relationships among perturbed transcriptomes [T^ . 
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The data contain two large, internally consistent, global mRNA expression 
subsets of the yeast S. cerevisiae. One of them provides steady state mRNA 
expression data in wile- type S. cerevisiae sampled 63 separate times (the 'con- 
trol' set), and the other provides individual measurements on the genomic 
expression program of 287 single gene deletion mutant S. cerevisiae strains 
grown under identical cell culture conditions as wide-type yeast cells (the 
'perturbation' set). Each of the microarray data is the ratio between the ex- 
pression levels of wild-type and perturbed one. Thus the data can be written 
in terms of a x M matrix denoted as C, where A^ = 6316 and M = 287, 
representing the total number of genes and different-deletion experiments, 
respectively. Each element Cij of the matrix C means the logarithmic value 
(base 10) of the ratio of the expression levels for the i-th gene under the j-th 
perturbation condition [T7j. 

Percolation clustering 

To obtain the correlations among the transcription genes, we compare each 
pair of the expression levels from different genes. For each pair, we first 
select the list of genes of which the expression levels are known in both 
transcriptomes. Next the Pearson correlation coefficient pij between i and j 
genes is calculated, defined as 

" 7((c?,,)-(q,.)2)((4,)-(c,,.)2)' 

where (■ ■ ■) means average over k, different-deletion experiments. As shown 
in Fig.l, the distribution of the correlations {pij} is of a bell shape, ranged 
between -1 and 1. Based on the Pearson's coefficients, we generate a network 
by connecting genes whose the Pearson's coefficient is larger than a param- 
eter p. That is, the link between nodes i and j is connected if pij > p. The 
parameter p will be determined later in a self-organized way. Each link is 
assumed to have a unit weight. Let p mean the fraction of connected links 
among A^(A^ — l)/2 possible pairs. Then p depends on p. When p is close to 
zero (p is close to 1), the number of links is small, and most nodes remain 
as isolated nodes or form small-size clusters. As p increases (p decreases), 
the size of each cluster grows or the number of clusters J\f{p) including at 
least two genes increases. At a certain value of p, denoted as pm, the number 
of clusters becomes maximum shown in Fig. 2, which is different from the 
percolation threshold Pc- Pm is estimated to be Pm ~ 0.0002. Beyond Pm, the 
number of clusters decreases, however, the mean size of each cluster increases. 

Scale-free network 
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As p increases, the mean size of each cluster increases. While the giant clus- 
ter forms at Pc, the degree distribution of the giant cluster does not follow a 
power law. The critical state where the degree distribution follows a power 
law can be reached at a higher fraction ps ~ 0.0063 in Fig.3. Note that 
the degree distribution needs an exponential cutoff in the tail part, which 
is a generic behavior due to finite number of genes. The degree exponent is 
estimated to be 7 ~ 0.9, which is close to the values obtained by others in 
different systems 118.^ .I9j, but smaller than typical values occurring in many 
other systems in the range of 2 < 7 < 3. For p > ps, the connectivity distri- 
bution does not follow the power law. 

To understand the biological implication of the scale-free network, we 
investigate the relationship between the degree of a certain gene and its es- 
sentiality. In Fig.4, we plot the fraction of the essential genes with degree 
larger than fcmin- Up to kmin ~ 250, the genes with a larger number of con- 
nectivity is more likely to be essential, but for fcmin > 250, this tendency 
does not hold any more. Even for the case of fcmin < 250, the fraction of the 
essentiality is not larger than 40%, less than the rate of 62% in the protein 
interaction network. Thus as a whole, the way of identifying essential genes 
from the information of the connectivity of the gene transcription network 
alone is not good enough. 



Method 

To improve the success rate of identifying essential genes through the mi- 
croarray assay, here we introduce a new method as follows. First links are 
connected between a pair of genes {i,j} one by one in descending order of 
Pi J. Whenever a link is connected, we measure the number of clusters M{p) 
including at least two genes as a function of p. Second we identify Pm where 
the number of clusters becomes maximum. Third we find the critical frac- 
tion Ps where the connectivity distribution follows a power law and measure 
the connectivity of each gene ki{ps). Finally keeping the information of the 
degree of each gene fcj at Ps, we return to the gene transcription network at 
Pm- For each cluster J, we calculate the average connectivity per node, that 
is, ^ 

^ " N^ipJ ' 

where N'^{pm) is the number of genes belonging to a cluster J. Based on the 
fact that genes with a larger number of connectivity are more likely to be 
essential, we think that the cluster with the largest value of (/c^) is the most 
likely to contain essential genes. 



5 



Essential cluster 

To confirm this idea, we directly measure the essentiality , that is the frac- 
tion of known essential genes among the genes belonging to a given cluster 
J. Indeed, as shown in Fig. 5, the two quantities, (/c^) and E"^ , behave in the 
same manner. Thus we can select the cluster containing the largest fraction 
of essential genes by finding the cluster with the largest {k"^)- We find that 
for the yeast data, the third largest cluster with 64 genes turns out to have 
the largest value of (/c^), containing 47 essential genes, 17 nonessential genes, 
and 1 unidentified genes (Fig. 6). Thus the certainty of selecting essential 
genes is remarkable improved as high as 73% or even higher when the uniden- 
tified gene is excluded, much larger than the one obtained only through the 
connectivity distribution in the gene transcription network. 

Functional clustering 

It is known that many biochemical networks are composed of modular struc- 
ture according to their functional role. For the yeast, it is known that there 
are 43 categories by their functions [T3] . We classify genes into 43 categories 
for each cluster at pm- Fig- 7 shows the ratio of genes belonging to each 
functional category for the first five largest clusters. Fig. 8 also shows the 
functional module structure in the gene transcription network. From those 
figures, one can find that there exist major functions for each cluster, imply- 
ing that the genes belonging to the same cluster are likely to have the same 
function. For example, the majority of the genes in the largest cluster be- 
long to the functional class of amino-acid metabolism. Those of the second, 
third and fourth largest cluster are of small molecule transport, RNA pro- 
cessing/modification, and protein synthesis, respectively. The reason of such 
functional clustering in the gene transcription network lies in that the genes 
having the same function are likely to respond to external perturbation in 
the same manner, making the Pearson correlation coefficients between them 
large. Our result is consistent with the recent discovery of revealing modular 
organization in the yeast transcription network [201 metabolic net- 

works [21j. Next by using the fact of the gene clustering by their functional 
module, we assign function candidate of unknown functional annotation as 
the major one of the genes belonging to the same cluster, which are listed in 
Table 1. 

Conclusion and discussion 

By using the facts that (i) the genes with the same function are highly corre- 
lated in the expression level of the microarray and (ii) the essential genes are 
likely to have a larger number of connectivity in the large-scale gene tran- 
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scription network, we have proposed an in silico method to identify a cluster 
containing a high density of essential genes. Since the selected genes are from 
the same cluster, they are likely to be of the same function. These essential 
and functionally moduled genes will be useful for drug designs. Note that 
since our method does not include any tuning parameter, it has no ambiguity 
to identify the essential cluster in contrast to previous other methods used 
in gene clustering, where some ambiguity is included. Finally our work is 
similar in idea to a recent one that the microarray-driven gene expression 
can be studied much efficiently in parallel to the functional analysis of many 
gene products [22] • 
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Figure Legends 



Fig. 1: The distribution of the correlation functions 

The distribution of the Pearson correlation coefficients pi^j for the yeast S. 
cerevisiae. 

Fig. 2: The number of clusters 

Plot of the number of clusters as a function of the fraction p of connected 
links. 

Fig. 3: The connectivity distribution of the gene transcription net- 
work 

Plot of the connectivity distribution of the gene transcription network at 
various fractions of link connections, p = 0.0003 (□), p = 0.0016 (O), 
p = 0.0063 ~ Ps (•), and p = 0.0032 (V). At p^, the degree distribu- 
tion follows a power law with an exponential cutoff. Dotted line having a 
slope -0.9 is drawn for the eye. 

Fig. 4: The fraction of the essentiality 

Plot of the fraction of the essentiality of nodes having degree larger than kram 
as a function of A;inin 

Fig. 5: The identification of the essential cluster 

The comparison between {k"^) (•) and S'^ (□) for each cluster indexed by 
cluster size at Pm- 

Fig. 6: The gene transcription network colored by their essentiality 

The gene transcription network at Pm of the yeast S. cerevisiae. The green, 
white and yellow nodes represent essential, nonessential, and unidentified 
genes, respectively. 

Fig. 7: The functional genes ratio for each cluster 
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The genes ratio belonging to each functional category for the first five largest 
clusters. 

Fig. 8: The gene transcription network colored by their functions. 

The gene transcription network at Pm of the yeast S. cerevisiae. The genes 
with the functions, amino-acid metabolism, small molecule transport, RNA 
processing/modification, protein synthesis are distinguished by the different 
colors, red, blue, green, and brown, respectively. The white and the yellow 
represent other functions and unknown function, respectively. 
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Table Legend 

Table 1: Function candidates for unknown genes 

Assigned functions for unknown genes by following the major function of the 
genes of each cluster at Pm- 
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Figure 6: Rho et al. 
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Figure 7: Rho et al 
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Figure 8: Rho et al. 
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Table 1: Rho et al. 



Amino-acid metabolism or Other metabolism - 52 genes 


YAL()14(: 


YBR()4G(" 


YBR()47\Y 


YBR147AA' 


YBR2G1C 


YCUYIHW 


YCL()44C 


YCR051W 


YDL054C 


YDR425W 


YDR426C 


YER152C 


YER175C 


YFLOIOC 


YFL028C 


YGL117W 


YGL224C 


YHR029C 


YHR122W 


YHR162W 


YIL041W 


YIL056W 


YIL164C 


YIL165C 


YJL072C 


YJL200C 


YJL213W 


YJRlllC 


YJR130C 


YJR154W 


YLR152C 


YLR193C 


YLR267W 


YLR290C 


YLR339C 


YML113W 


YMR097C 


YMR321C 


YNL129W 


YNL276C 


YNL311C 


YNR069C 


Y0L118C 


YOR042W 


YOR044W 


YOR203W 


YPL135W 


YPL251W 


YPL264C 


YPR059C 


YPR114W 


YKL033W-A 











Small molecule transport - 48 genes 


YAL065C 


YBR301W 


YCR104W 


YDR542W 


YEL049W 


YER188W 


YFL020C 


YGL261C 


YGR150C 


YGR169C 


YGR294W 


YHL046C 


YHR049W 


YIL176C 


YIR041W 


YJL218W 


YJL223C 


YKL005C 


YKL224C 


YLL025W 


YLL056C 


YLL064C 


YLR037C 


YLR091W 


YLR269C 


YLR461W 


YMR020W 


YMR107W 


YA/IR252C 


YMR253C 


YNL285W 


YNL310C 


YNR014W 


YNR076W 


Y0L161C 


YOR134W 


YOR205C 


YOR286W 


YOR389W 


YOR394W 


YPL107W 


YPL277C 


YPL282C 


YPR053C 


YAL068C 


YHR049C-A 


YMR316C-B 


YMR325W 





RNA Processing/modification - 31 genes 


YBL028C 


YCL059C 


YDL063C 


YDL148C 


YDRIOIC 


YDR152W 


YDR165W 


YDR324C 


YDR361C 


YDR496C 


YER126C 


YGR128C 


YGR145W 


YHR052W 


YHR085W 


YHR196W 


YHR197W 


YKR060W 


YKR081C 


YLR068W 


YLR129W 


YML093W 


YNL002C 


YNL182C 


YNL207W 


YNR053C 


YOR004W 


YOL077C 


YOR 145C 


YPL()12W 


YPLUGC 











Protein Synthesis - 5 genes 


YGL102C 


YJL188C 


YLR062C 


YPL142C 


YPR044C 







Carbohydrate metabolism 


or Cell stress - 


17 genes 








YBR053C 


YDLllOC 


YDL204W 


YDR032C 


YER067W 


YIL136W 


YJL070C 


YJL161W 


YLR149C 


YLR270W 


YML128C 


YMRllOC 


YNL115C 


YNL200C 


YOL082W 


YPL123C 


YMR169C 












Energy generation - 7 genes 


YGL069C 


YKL169C 


YKL195W 


YMR158W 


YPR099C 


YPRIOOW 


YKL053C-A 



Chromatin/chromosome structure - 3 genes 


YBL113C YFL068W YML133C 
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